Parallelizing Word2Vec in Shared and Distributed Memory
نویسندگان
چکیده
Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures but are based on vector-vector operations that are memory-bandwidth intensive and do not efficiently use computational resources. In this work, we improve reuse of various data structures in the algorithm through the use of minibatching, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to parallelize word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. In combination, these techniques allow us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.
منابع مشابه
Experiments with Parallelizing a Tribology Application
Different parallelization methods vary in their system requirements, programming styles, efficiency of exploring parallelism, and the application characteristics they can handle. Different applications can exhibit totally different performance gains depending on the parallelization method used. This paper compares OpenMP, MPI, and Strings( A distributed shared memory) for parallelizing a compli...
متن کاملCommunication Analysis for Shared and Distributed Memory Machines
Advances in programming languages and parallelizing compilers are making parallel computers easier to use by providing a high-level portable programming model that protects software investment. However, experience has shown that simply finding parallelism is not always sufficient for obtaining good performance from today’s multiprocessors, largely because the cost of interprocessor communicatio...
متن کاملParallelization of NAS Benchmarks for Shared Memory Multiprocessore
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of port...
متن کاملParallelization of NAS Benchmarks for Shared Memory Multiprocessors
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of port...
متن کاملAutomatic Localization for Distributed-Memory Multiprocessors Using a Shared-Memory Compilation Framework
In this paper, we outline an approach for compiling for distributed-memory multiprocessors that is inherited from compiler technologies for shared-memory multiprocessors. We believe that this approach to compiling for distributed-memory machines is promising because it is a logical extension of the shared-memory parallel programming model, a model that is easier for programmers to work with, an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1604.04661 شماره
صفحات -
تاریخ انتشار 2016